Project Summary

Introduction

This notebook is to analyze the daily time series data of Citi Bike rides in NYC after Covid from 2022 March 1st to 2024 March 31st. The goal is to forecast NYC bike demand to help Citi Bike company better optimize bike distribution and maintenance planning in NYC.

Datasets

  1. Citi Bike Trip Data: https://citibikenyc.com/system-data
  2. NYC Weather Data: https://mesonet.agron.iastate.edu/request/download.phtml?network=NY_ASOS

The final data includes the bike demand variable “number_of_rides” and numerical weather variables. I included weather variables to help predict the bike demand in some of the models below.

Note: I aggregated 3.3 million bike rides record in 2 years into daily time series and combined it with averaged daily weather data. These data pre-processing steps are omitted for simplicity.

Project Sections

  1. Daily Time Series EDA

Investigated the characteristics of the time series, such as variance, trend, seasonality, stationarity, etc.

  1. Model Selection & Model Estimation & Model Diagnostics

Experimented with 6 types of models, optimized model parameters, and conducted model diagnostics based on several criteria, such as AICc and residuals.

Model 1: SARIMA(single seasonality)

Model 2: Linear Regression(Weather Predictors)

Model 3: Regression with ARIMA Errors(Weather Predictors)

Model 4: Dynamic Harmonic Regression(Multi-seasonality)

Model 5: TBATS(Complex Seasonality)

Model 6: Intervention Analysis(No Intervention Effect from Covid-19)

  1. Model Evaluation of Forecast Accuracy via Cross-Validation

Selected Dynamic Harmonic Regression Model as the best model due to the multi-seasonality nature of the data and evaluated the model’s forecast accuracy via cross-validation using various error metrics.

  1. Future Forecast by Dynamic Harmonic Regression

Used the best Dynamic Harmonic Regression to forecast next 30 days(April 2024) and compared forecast values against actual Citi Bike rides.

  1. Future Work

Author & Platform

Yezi Liu conducted this project independently using R Studio.

1. Daily Time Series EDA

final_average_df <- read.csv("final_daily_nyc_df.csv")
daily_ts <- ts(final_average_df$number_of_rides, start = c(2022, 60), frequency = 365.25)

The meaning of frequency:

Frequency = 365.25: Forecasting functions will consider the data as having a daily cycle throughout the year, accounting for leap years over long-term predictions. This can affect the forecasting of trends and seasonality.

Frequency = 1: The data are considered as sequential with no seasonal or cyclic pattern, so forecasts will not try to model any periodic changes.

print(head(daily_ts))
## [1] 57432 68718 56795 51059 52695 55948
print(length(daily_ts))
## [1] 762
plot(daily_ts, xlab = "Time", ylab = "Number of Rides", main = "Daily Bike Rides from March 2022 to March 2024")

Box Cox Transformation

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
lambda <- BoxCox.lambda(daily_ts)
daily_ts_boxcox <- BoxCox(daily_ts, lambda)

plot(daily_ts_boxcox, main = paste("Box-Cox Transformed Time Series Data, Lambda =", round(lambda, 2)), xlab = "Time", ylab = "Transformed Bike Rides")

lambda = 0.37: a Box-Cox transformation is needed to stabilize the variance.

ACF & PACF

library(forecast)

Acf(daily_ts, main="ACF for Daily Demand of Bike Rides")

Pacf(daily_ts, main="PACF for Daily Demand of Bike Rides")

Observations:

  1. Non-stationary
  2. Annual seasonality in the long term
  3. Potential weekly seasonality in the short term(will analyze later)

Decompose Method 1

bike_decompose <- decompose(daily_ts)
plot(bike_decompose)

Decompose Method 2

library(forecast) 
bike_decompose <- stl(daily_ts, s.window="periodic")
plot(bike_decompose)

plot(bike_decompose$time.series[, "trend"], main="Trend Component")

plot(bike_decompose$time.series[, "seasonal"], main="Seasonal Component")

plot(bike_decompose$time.series[, "remainder"], main="Residual Component")

KPSS Test to check stationarity(trend)

library(tseries)
kpss_test_original <- kpss.test(daily_ts)
print("KPSS Test for Original Data:")
## [1] "KPSS Test for Original Data:"
print(kpss_test_original)
## 
##  KPSS Test for Level Stationarity
## 
## data:  daily_ts
## KPSS Level = 0.63784, Truncation lag parameter = 6, p-value = 0.0192

p-value = 0.02 < 0.05: I reject hypothesis that the time series is trend-stationary, indicating non-stationary for increasing trend(Need trend differencing d = 1).

ADF Test to check stationarity

library(tseries)
adf_test <- adf.test(daily_ts, alternative = "stationary")
adf_test$p.value
## [1] 0.07014074

A larger p-value (0.07 > 0.05) indicates I fail to reject the null hypothesis, and conclude that the series is nonstationary.

Seasonal Differencing

Weekly Seasonality Testing

library(tseries)
daily_ts_seasonal_diff1 <- diff(daily_ts, lag=7)
kpss_test <- kpss.test(daily_ts_seasonal_diff1)
## Warning in kpss.test(daily_ts_seasonal_diff1): p-value greater than printed
## p-value
print("KPSS Test for Weekly Seasonality:")
## [1] "KPSS Test for Weekly Seasonality:"
print(kpss_test)
## 
##  KPSS Test for Level Stationarity
## 
## data:  daily_ts_seasonal_diff1
## KPSS Level = 0.046564, Truncation lag parameter = 6, p-value = 0.1
adf_test <- adf.test(daily_ts_seasonal_diff1, alternative = "stationary")
## Warning in adf.test(daily_ts_seasonal_diff1, alternative = "stationary"):
## p-value smaller than printed p-value
print("ADF Test P-value for Weekly Seasonality:")
## [1] "ADF Test P-value for Weekly Seasonality:"
adf_test$p.value
## [1] 0.01

The time series is stationary after weekly seasonal differencing.

Monthly Seasonality Testing

library(tseries)
daily_ts_seasonal_diff2 <- diff(daily_ts, lag=28)
kpss_test <- kpss.test(daily_ts_seasonal_diff2)
print("KPSS Test for Monthly Seasonality:")
## [1] "KPSS Test for Monthly Seasonality:"
print(kpss_test)
## 
##  KPSS Test for Level Stationarity
## 
## data:  daily_ts_seasonal_diff2
## KPSS Level = 0.40645, Truncation lag parameter = 6, p-value = 0.07437
adf_test <- adf.test(daily_ts_seasonal_diff2, alternative = "stationary")
## Warning in adf.test(daily_ts_seasonal_diff2, alternative = "stationary"):
## p-value smaller than printed p-value
print("ADF Test P-value for Monthly Seasonality:")
## [1] "ADF Test P-value for Monthly Seasonality:"
adf_test$p.value
## [1] 0.01

The time series is stationary after monthly seasonal differencing.

Annual Seasonality Testing

library(tseries)
daily_ts_seasonal_diff3 <- diff(daily_ts, lag=365)
kpss_test <- kpss.test(daily_ts_seasonal_diff3)
## Warning in kpss.test(daily_ts_seasonal_diff3): p-value greater than printed
## p-value
print("KPSS Test for Annual Seasonality:")
## [1] "KPSS Test for Annual Seasonality:"
print(kpss_test)
## 
##  KPSS Test for Level Stationarity
## 
## data:  daily_ts_seasonal_diff3
## KPSS Level = 0.040434, Truncation lag parameter = 5, p-value = 0.1
adf_test <- adf.test(daily_ts_seasonal_diff3, alternative = "stationary")
## Warning in adf.test(daily_ts_seasonal_diff3, alternative = "stationary"):
## p-value smaller than printed p-value
print("ADF Test P-value for Annual Seasonality:")
## [1] "ADF Test P-value for Annual Seasonality:"
adf_test$p.value
## [1] 0.01

The time series is stationary after annual seasonal differencing.

Conclusion: The daily time series might have multiple seasonalities(weekly, monthly, and annual seasonality).

2. Model Selection & Model Estimation & Model Diagnostics

Model 1: SARIMA

arima_model_1 <- auto.arima(daily_ts, seasonal = TRUE, lambda = 0.37)
## Warning: The time series frequency has been rounded to support seasonal
## differencing.
summary(arima_model_1)
## Series: daily_ts 
## ARIMA(1,0,0)(0,1,0)[365] with drift 
## Box Cox transformation: lambda= 0.37 
## 
## Coefficients:
##          ar1   drift
##       0.2949  0.0270
## s.e.  0.0479  0.0044
## 
## sigma^2 = 508.8:  log likelihood = -1799.3
## AIC=3604.6   AICc=3604.66   BIC=3616.55
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 159.6668 19900.41 10453.97 -3.894772 14.25003 0.4393484 0.02238712

The model has recognized a significant seasonal pattern occurring every 365 days, reinforcing the choice of using an annual pattern for seasonality treatment. But SARIMA can only capture single seasonality.

ARIMA(1,0,0)(0,1,0)[365] with drift : AICc=3604.66

checkresiduals(arima_model_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,0)[365] with drift
## Q* = 358.49, df = 151, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 152

A small p-value from Ljung-Box test suggests that there is significant autocorrelation in residuals, indicating that the model hasn’t captured most of the temporal structure in the data. There are too many significant lags in ACF as well, indicating auto-correlations in the residuals. This SARIMA model is underfitting.

Experiment with different Sarima Models

library(fpp)
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(TSA)
## Registered S3 methods overwritten by 'TSA':
##   method       from    
##   fitted.Arima forecast
##   plot.Arima   forecast
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
eacf(daily_ts)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x  x  x  x 
## 1 x x o o o o x o o o o  o  o  x 
## 2 x o o x o o x x o o o  o  o  o 
## 3 x o o o o o x o o o o  o  o  o 
## 4 x x x o o o x o o o o  o  o  o 
## 5 x x x x o x x o o o o  o  o  o 
## 6 x x x x x x x o o o o  o  o  o 
## 7 o x x x x x x o x o o  o  o  o

eacf() only handles non seasonal parts of arima(p, q).

arima_model_2 <- Arima(daily_ts, order=c(2,0,1), seasonal=c(0,1,0), include.drift=TRUE, lambda = 0.37)
summary(arima_model_2)
## Series: daily_ts 
## ARIMA(2,0,1)(0,1,0)[365] with drift 
## Box Cox transformation: lambda= 0.37 
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1    ar2     ma1   drift
##       0.1428  0.033  0.1562  0.0270
## s.e.     NaN    NaN     NaN  0.0043
## 
## sigma^2 = 511:  log likelihood = -1799.26
## AIC=3608.53   AICc=3608.68   BIC=3628.45
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 157.1601 19896.73 10455.72 -3.895559 14.25095 0.4394219 0.0178957

AICc=3608.68

checkresiduals(arima_model_2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,1)(0,1,0)[365] with drift
## Q* = 356.75, df = 149, p-value < 2.2e-16
## 
## Model df: 3.   Total lags used: 152
arima_model_3 <- Arima(daily_ts, order=c(1,1,0), seasonal=c(0,1,0), lambda = 0.37)
summary(arima_model_3)
## Series: daily_ts 
## ARIMA(1,1,0)(0,1,0)[365] 
## Box Cox transformation: lambda= 0.37 
## 
## Coefficients:
##           ar1
##       -0.3421
## s.e.   0.0472
## 
## sigma^2 = 692.5:  log likelihood = -1859.9
## AIC=3723.8   AICc=3723.83   BIC=3731.76
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -715.9327 24210.33 12734.49 -4.529659 16.44589 0.5351916
##                     ACF1
## Training set -0.03113498

AICc=3723.83

arima_model_4 <- Arima(daily_ts, order=c(1,0,1), seasonal=c(0,1,0), include.drift=TRUE, lambda = 0.37)
summary(arima_model_4)
## Series: daily_ts 
## ARIMA(1,0,1)(0,1,0)[365] with drift 
## Box Cox transformation: lambda= 0.37 
## 
## Coefficients:
##          ar1     ma1   drift
##       0.2535  0.0455  0.0270
## s.e.  0.1608  0.1659  0.0043
## 
## sigma^2 = 509.7:  log likelihood = -1799.26
## AIC=3606.52   AICc=3606.63   BIC=3622.46
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 154.3736 19896.67 10456.62 -3.898429 14.25234 0.4394598 0.01773766

AICc=3606.63

arima_model_5 <- Arima(daily_ts, order=c(1,0,2), seasonal=c(0,1,0), include.drift=TRUE, lambda = 0.37)
summary(arima_model_5)
## Series: daily_ts 
## ARIMA(1,0,2)(0,1,0)[365] with drift 
## Box Cox transformation: lambda= 0.37 
## 
## Coefficients:
##          ar1     ma1     ma2   drift
##       0.1787  0.1201  0.0248  0.0270
## s.e.  0.4916  0.4903  0.1535  0.0043
## 
## sigma^2 = 511:  log likelihood = -1799.26
## AIC=3608.51   AICc=3608.66   BIC=3628.43
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 150.8366 19895.72 10460.67 -3.901439 14.25689 0.4396298 0.01765922

AICc=3608.66

arima_model_6 <- Arima(daily_ts, order=c(1,0,0), seasonal=c(0,1,0), lambda = 0.37)
summary(arima_model_6)
## Series: daily_ts 
## ARIMA(1,0,0)(0,1,0)[365] 
## Box Cox transformation: lambda= 0.37 
## 
## Coefficients:
##          ar1
##       0.3993
## s.e.  0.0459
## 
## sigma^2 = 548.4:  log likelihood = -1814.85
## AIC=3633.7   AICc=3633.73   BIC=3641.66
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 4265.372 20349.81 11200.93 0.7656516 14.27421 0.4707407
##                     ACF1
## Training set -0.02986718

AICc=3633.73

Conclusion: the SARIMA model with lowest AICc is ARIMA(1,0,0)(0,1,0)[365] with drift with AICc=3604.66

Examine the Best SARIMA Model(Model Diagnotics): ARIMA(1,0,0)(0,1,0)[365] with drift

Ljung-Box test & Shapiro-Wilk normality test
checkresiduals(arima_model_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,0)[365] with drift
## Q* = 358.49, df = 151, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 152
shapiro.test(arima_model_1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  arima_model_1$residuals
## W = 0.77065, p-value < 2.2e-16

Ljung-Box test: p-value < 2.2e-16 -> residuals are autocorrelated.

Shapiro-Wilk normality test: p-value < 2.2e-16 -> residuals are not normally distributed.

QQNorm plot
qqnorm(arima_model_1$residuals,main=expression(Normal~~Q-Q~~Plot))
qqline(arima_model_1$residuals)

Conclusion with SARIMA Model:

  1. The best model is ARIMA(1,0,0)(0,1,0)[365] with drift with lowest AICc=3604.66.

  2. But this model’s residuals are autocorrelated and aren’t normally distributed. There are systematic patterns in the data that the model fails to capture.

  3. Therefore, this model is underfitting and the potential reason is that ARIMA/SARIMA models can only capture single seasonality.

  4. I will examine multi-variable and multi-seasonality scenarios.

Model 2: Linear Regression with Weather Variables

final_average_df <- read.csv("final_daily_nyc_df.csv")
head(final_average_df)
##         date     tmpf     dwpf     relh     drct     sknt        p01i     alti
## 1 2022-03-01 40.50000 23.32917 51.52625 156.4286 5.000000 0.000000000 30.13625
## 2 2022-03-02 45.58333 27.49583 50.73917 208.3333 4.636364 0.000000000 29.97792
## 3 2022-03-03 37.93462 15.85000 43.33346 293.5000 8.760000 0.001363636 30.08000
## 4 2022-03-04 29.07391  4.40000 35.18652 205.8333 4.666667 0.000000000 30.50174
## 5 2022-03-05 39.43478 19.26522 44.81957  65.0000 3.272727 0.000000000 30.47130
## 6 2022-03-06 54.00000 46.23684 75.53211 131.4286 4.529412 0.006875000 30.06237
##       mslp      vsby     feel number_of_rides
## 1 1019.704 10.000000 36.81583           57432
## 2 1014.321  9.708333 42.85409           68718
## 3 1018.418  9.884615 31.71600           56795
## 4 1032.178 10.000000 23.36524           51059
## 5 1031.030 10.000000 36.90227           52695
## 6 1016.413  7.684211 53.42237           55948
tail(final_average_df)
##           date     tmpf     dwpf     relh      drct     sknt         p01i
## 757 2024-03-26 44.87027 28.58919 53.35378  57.60000 5.694444 0.0000000000
## 758 2024-03-27 46.51053 39.76842 77.60816  65.58824 2.972973 0.0011764706
## 759 2024-03-28 48.58710 44.85484 86.95145 109.38776 3.066667 0.0465217391
## 760 2024-03-29 46.70833 20.83333 36.32000 304.21053 9.476190 0.0000000000
## 761 2024-03-30 50.04167 25.08333 41.03875 281.87500 7.391304 0.0004761905
## 762 2024-03-31 54.65385 33.07692 46.50577 195.38462 4.238095 0.0020833333
##         alti     mslp      vsby     feel number_of_rides
## 757 30.22973 1022.600  9.972973 41.93108          105245
## 758 30.11184 1019.038  7.631579 44.82135           95913
## 759 29.98113 1013.583  5.387097 47.41683           55955
## 760 29.77167 1007.329 10.000000 42.73909           99427
## 761 29.78792 1007.929 10.000000 47.66042           97784
## 762 29.89192 1011.275  9.557692 54.28192           93274
final_average_df$date <- as.Date(final_average_df$date)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data_matrix <- final_average_df %>%
  select(-date) %>%
  as.matrix()

multivar_ts <- ts(data_matrix, start = c(2022, 60), frequency = 365.25)
str(multivar_ts)
##  Time-Series [1:762, 1:11] from 2022 to 2024: 40.5 45.6 37.9 29.1 39.4 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:11] "tmpf" "dwpf" "relh" "drct" ...
correlations <- cor(multivar_ts[, -11], multivar_ts[, 11])  

print(correlations)
##              [,1]
## tmpf  0.738914694
## dwpf  0.561297552
## relh -0.127238016
## drct -0.253614088
## sknt -0.520515844
## p01i -0.256448227
## alti  0.019352370
## mslp -0.009804981
## vsby  0.335498160
## feel  0.740480305

Pick Highly Correlated Regressors

rides <- multivar_ts[, "number_of_rides"]

# Most correlated
tmpf <- multivar_ts[, "tmpf"]
dwpf <- multivar_ts[, "dwpf"]
sknt <- multivar_ts[, "sknt"]
feel <- multivar_ts[, "feel"]

# Moderately correlated 
vsby <- multivar_ts[, "vsby"]
drct <- multivar_ts[, "drct"]
p01i <- multivar_ts[, "p01i"]

## Less correlated
relh <- multivar_ts[, "relh"]
alti <- multivar_ts[, "alti"]
mslp <- multivar_ts[, "mslp"]
plot(rides, xlab = "Time")

plot(tmpf, xlab = "Time")

plot(dwpf, xlab = "Time")

plot(sknt, xlab = "Time")

plot(feel, xlab = "Time")

library(forecast)
linear_model1 <- tslm(rides ~ tmpf + dwpf + sknt + feel, data = multivar_ts)
print(summary(linear_model1))
## 
## Call:
## tslm(formula = rides ~ tmpf + dwpf + sknt + feel, data = multivar_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60233 -12849   -452  13694  57477 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21100.91    5863.66   3.599 0.000341 ***
## tmpf         1715.01     644.38   2.661 0.007944 ** 
## dwpf        -1095.50      93.19 -11.756  < 2e-16 ***
## sknt        -3226.67     420.87  -7.667 5.42e-14 ***
## feel          645.60     578.92   1.115 0.265132    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19100 on 757 degrees of freedom
## Multiple R-squared:  0.637,  Adjusted R-squared:  0.6351 
## F-statistic: 332.1 on 4 and 757 DF,  p-value: < 2.2e-16
print(checkresiduals(linear_model1))

## 
##  Breusch-Godfrey test for serial correlation of order up to 152
## 
## data:  Residuals from Linear regression model
## LM test = 392.44, df = 152, p-value < 2.2e-16

Adjusted R-squared: 0.6351

Significant autocorrelations in residuals ; insignificant variable “feel”.

library(forecast)
linear_model2 <- tslm(rides ~ tmpf + dwpf + sknt, data = multivar_ts)
print(summary(linear_model2))
## 
## Call:
## tslm(formula = rides ~ tmpf + dwpf + sknt, data = multivar_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60676 -13028   -534  13514  57981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16792.80    4411.86   3.806 0.000152 ***
## tmpf         2423.71     106.53  22.752  < 2e-16 ***
## dwpf        -1083.50      92.58 -11.703  < 2e-16 ***
## sknt        -3381.36     397.42  -8.508  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19110 on 758 degrees of freedom
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.635 
## F-statistic: 442.3 on 3 and 758 DF,  p-value: < 2.2e-16
print(checkresiduals(linear_model2))

## 
##  Breusch-Godfrey test for serial correlation of order up to 152
## 
## data:  Residuals from Linear regression model
## LM test = 394.32, df = 152, p-value < 2.2e-16

Adjusted R-squared: 0.635

library(forecast)
linear_model3 <- tslm(rides ~ tmpf + dwpf + sknt + feel + vsby + drct + p01i, data = multivar_ts)
print(summary(linear_model3))
## 
## Call:
## tslm(formula = rides ~ tmpf + dwpf + sknt + feel + vsby + drct + 
##     p01i, data = multivar_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50836 -11774    -43  12277  48344 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -9875.57    6427.92  -1.536    0.125    
## tmpf            672.14     591.95   1.135    0.257    
## dwpf            -62.61     119.70  -0.523    0.601    
## sknt          -2325.53     428.37  -5.429 7.66e-08 ***
## feel            713.55     526.13   1.356    0.175    
## vsby           4576.36     498.37   9.183  < 2e-16 ***
## drct              1.62      10.39   0.156    0.876    
## p01i        -151819.52   33926.87  -4.475 8.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17310 on 754 degrees of freedom
## Multiple R-squared:  0.7031, Adjusted R-squared:  0.7003 
## F-statistic: 255.1 on 7 and 754 DF,  p-value: < 2.2e-16
print(checkresiduals(linear_model3))

## 
##  Breusch-Godfrey test for serial correlation of order up to 152
## 
## data:  Residuals from Linear regression model
## LM test = 449.59, df = 152, p-value < 2.2e-16

Adjusted R-squared: 0.7003

library(forecast)
linear_model4 <- tslm(rides ~ tmpf + dwpf + sknt + feel + vsby + drct + p01i + relh + alti + mslp, data = multivar_ts)
print(summary(linear_model4))
## 
## Call:
## tslm(formula = rides ~ tmpf + dwpf + sknt + feel + vsby + drct + 
##     p01i + relh + alti + mslp, data = multivar_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51450 -11752    426  12247  46994 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.785e+04  1.138e+05   0.421   0.6742    
## tmpf        -5.334e+02  8.649e+02  -0.617   0.5376    
## dwpf         9.573e+02  5.902e+02   1.622   0.1052    
## sknt        -2.090e+03  4.378e+02  -4.775 2.16e-06 ***
## feel         9.321e+02  5.335e+02   1.747   0.0810 .  
## vsby         3.823e+03  6.308e+02   6.060 2.15e-09 ***
## drct        -3.607e-01  1.143e+01  -0.032   0.9748    
## p01i        -1.372e+05  3.445e+04  -3.981 7.52e-05 ***
## relh        -5.351e+02  3.073e+02  -1.741   0.0820 .  
## alti         8.258e+04  4.117e+04   2.006   0.0452 *  
## mslp        -2.446e+03  1.228e+03  -1.992   0.0467 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17260 on 751 degrees of freedom
## Multiple R-squared:  0.706,  Adjusted R-squared:  0.7021 
## F-statistic: 180.3 on 10 and 751 DF,  p-value: < 2.2e-16
print(checkresiduals(linear_model4))

## 
##  Breusch-Godfrey test for serial correlation of order up to 152
## 
## data:  Residuals from Linear regression model
## LM test = 458.5, df = 152, p-value < 2.2e-16

Adjusted R-squared: 0.7021

Conclusion:

  1. Linear regressions fail to capture nonlinear, oscillation relationship in the data.

  2. Residuals are autocorrelated (Breusch-Godfrey test for serial correlation), residual ACF resembles original data ACF, so this model is underfitting.

  3. Use Regression With ARIMA Errors as the next step.

Model 3: Regression With ARIMA Errors

Before Fitting the Model: Apply Seasonal Differencing to Make Regressors Stationary

library(forecast)
# Chose the 7-regressor combination from linear regression with adjusted R-squared 0.7003.
regressors1 <- c('tmpf', 'dwpf', 'sknt', 'feel', 'vsby', 'drct', 'p01i')

# Apply seasonal differencing to each regressor
seasonal_lag <- 365
diffed_regressors1 <- sapply(multivar_ts[, regressors1], function(x) diff(x, lag = seasonal_lag))

rides_aligned <- rides[-(1:seasonal_lag)]
check_stationarity <- function(ts_data) {
  adf_result <- adf.test(ts_data)
  kpss_result <- kpss.test(ts_data)
  return(list(adf_p_value = adf_result$p.value, kpss_p_value = kpss_result$p.value))
}

for (i in 1:ncol(diffed_regressors1)) {
  regressor_name <- colnames(diffed_regressors1)[i]
  regressor_data <- diffed_regressors1[, i]
  
  test_results <- check_stationarity(regressor_data)
  cat("Stationarity tests for:", regressor_name, "\n")
  cat("ADF test p-value:", test_results$adf_p_value, "\n")
  cat("KPSS test p-value:", test_results$kpss_p_value, "\n\n")
}
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in kpss.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: tmpf 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value

## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: dwpf 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value

## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: sknt 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value

## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: feel 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value

## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: vsby 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value

## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: drct 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value

## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: p01i 
## ADF test p-value: 0.01 
## KPSS test p-value: 0.1
reg_arima_1 <- auto.arima(rides_aligned, xreg = diffed_regressors1, lambda = "auto", seasonal = TRUE)
print(summary(reg_arima_1))
## Series: rides_aligned 
## Regression with ARIMA(0,1,0) errors 
## Box Cox transformation: lambda= 1.568598 
## 
## Coefficients:
##           drift       tmpf      dwpf       sknt      feel       vsby       drct
##        13394.28  -40362.58  67617.50  -650519.8  122065.2  2102166.4  15233.571
## s.e.  810780.02  672445.05  94865.52   337885.8  569340.3   378790.3   7433.018
##            p01i
##       -51412376
## s.e.   33841961
## 
## sigma^2 = 1.833e+14:  log likelihood = -7060.59
## AIC=14139.19   AICc=14139.66   BIC=14175.02
## 
## Training set error measures:
##                    ME    RMSE      MAE       MPE     MAPE     MASE       ACF1
## Training set 198.9138 20761.6 15033.97 -3.119669 20.06502 0.902888 -0.2406174
print(checkresiduals(reg_arima_1))

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,1,0) errors
## Q* = 62.261, df = 10, p-value = 1.35e-09
## 
## Model df: 0.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,1,0) errors
## Q* = 62.261, df = 10, p-value = 1.35e-09

AICc=14139.66, p-value = 1.35e-09 for Ljung-Box test so residuals are autocorrelated.

In residual ACF, significant lags at 1, 7, and 21, indicating that there is still weekly seasonality in the residuals. Therefore, the multi-seasonality model is introduced.

Model 4: Dynamic Harmonic Regression

Periodogram

temp <- periodogram(daily_ts)

max_freq1 <- temp$freq[which.max(temp$spec)]
seasonality1 <- 1/max_freq1
print("First Seasonality: ")
## [1] "First Seasonality: "
print(seasonality1)
## [1] 384
modified_spec <- temp$spec
modified_spec[which.max(temp$spec)] <- -Inf 
max_freq2 <- temp$freq[which.max(modified_spec)]
seasonality2 <- 1/max_freq2
print("Second Seasonality: ")
## [1] "Second Seasonality: "
print(seasonality2)
## [1] 6.981818
# Extract top 5 frequencies based on highest spectral power
top_freqs <- temp$freq[order(temp$spec, decreasing = TRUE)[1:5]]
top_specs <- sort(temp$spec, decreasing = TRUE)[1:5]

seasonalities <- 1 / top_freqs

results <- data.frame(Frequency = top_freqs, Spectral_Power = top_specs, Seasonality = seasonalities)
print(results)
##     Frequency Spectral_Power Seasonality
## 1 0.002604167   371461518563  384.000000
## 2 0.143229167    32730774449    6.981818
## 3 0.001302083    21049688114  768.000000
## 4 0.003906250    14254414458  256.000000
## 5 0.007812500    11915750567  128.000000

Since the first seasonality/period is 384 days and the second seasonality/period is 7 days, our data has both weekly seasonality and annual seasonality.

Create Multi-seasonal ts object

library(forecast)

final_average_df <- read.csv("final_daily_nyc_df.csv")
daily_msts <- msts(final_average_df$number_of_rides, seasonal.periods=c(365.25, 7), start = c(2022, 60))

summary(daily_msts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9851   69159   97296   93831  119554  161586

BoxCox Transformation

library(forecast)
lambda <- BoxCox.lambda(daily_msts)
daily_msts_boxcox <- BoxCox(daily_msts, lambda)

plot(daily_msts_boxcox, main = paste("Box-Cox Transformed Time Series Data, Lambda =", round(lambda, 2)), xlab = "Time", ylab = "Transformed Bike Rides")

Decomposition with multiple seasonal periods

daily_msts %>% mstl() %>%
  autoplot()

Experiment with hyperparameter ‘K’

xreg <- fourier(daily_msts, K=c(3, 4))
dh_regression1 <- auto.arima(daily_msts, xreg=xreg, seasonal=FALSE, lambda = 'auto')
summary(dh_regression1)
## Series: daily_msts 
## Regression with ARIMA(1,1,1) errors 
## Box Cox transformation: lambda= 0.3748346 
## 
## Coefficients:
##          ar1      ma1    S1-7     C1-7    S2-7    C2-7     S3-7    C3-7
##       0.3549  -0.9854  6.6140  -5.1123  1.7794  0.8221  -0.5086  0.9097
## s.e.  0.0345   0.0050  0.9977   0.9980  0.7278  0.7287   0.6205  0.6217
##        S1-365    C1-365  S2-365  C2-365   S3-365  C3-365   S4-365  C4-365
##       15.6286  -23.7366  5.6410  0.0743  -0.9852  2.5459  -3.8887  0.4182
## s.e.   1.8379    1.6996  1.4281  1.4028   1.3380  1.3405   1.3083  1.3137
## 
## sigma^2 = 268.7:  log likelihood = -3201.53
## AIC=6437.06   AICc=6437.88   BIC=6515.84
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
## Training set 2837.729 18139.73 13687.02 -3.34914 19.44719 0.5752233 -0.02917148
checkresiduals(dh_regression1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,1,1) errors
## Q* = 170.23, df = 150, p-value = 0.1236
## 
## Model df: 2.   Total lags used: 152
newharmonics <- fourier(daily_msts, K=c(2, 5), h=365)
fc <- forecast(dh_regression1, xreg=newharmonics)
## Warning in forecast.forecast_ARIMA(dh_regression1, xreg = newharmonics): xreg
## contains different column names from the xreg used in training. Please check
## that the regressors are in the same order.
autoplot(fc)

  1. K=c(3, 4): Passed Ljung-Box test, AICc=6437.88, MAE=13687.02

  2. K=c(2, 5):Passed Ljung-Box test, AICc=6433.83, MAE=13538.87

Selecting K and ARIMA model by minimizing AICc

library(forecast)

bestfit<- list(aicc=Inf)
maxK_weekly <- 3  # Up to 3 because 7/2 = 3.5
maxK_annual <- 10 

for (i in 1:maxK_weekly) {
  for (j in 1:maxK_annual) {
    xreg <- fourier(daily_msts, K=c(i, j))
    fit <- auto.arima(daily_msts, xreg=xreg, seasonal=FALSE, lambda = 'auto')
    
    if (fit$aicc < bestfit$aicc) {
      bestfit <- fit
      best_K <- c(i, j)  
    }
  }
}
summary(bestfit)
## Series: daily_msts 
## Regression with ARIMA(2,1,3) errors 
## Box Cox transformation: lambda= 0.3748346 
## 
## Coefficients:
##           ar1     ar2      ma1      ma2     ma3    S1-7     C1-7    S2-7
##       -0.4713  0.4633  -0.1709  -0.9754  0.1689  6.6281  -5.1119  1.7790
## s.e.   0.1152  0.1031   0.1288   0.0131  0.1228  0.9465   0.9467  0.7273
##         C2-7   S1-365    C1-365  S2-365  C2-365   S3-365  C3-365   S4-365
##       0.8207  15.5534  -23.7351  5.5801  0.0903  -1.0444  2.5825  -3.9401
## s.e.  0.7282   1.8766    1.7340  1.4788  1.4529   1.3915  1.3940   1.3624
##       C4-365  S5-365   C5-365
##       0.4771  0.9782  -2.3900
## s.e.  1.3677  1.3518   1.3493
## 
## sigma^2 = 266:  log likelihood = -3196.35
## AIC=6432.7   AICc=6433.83   BIC=6525.39
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2793.025 17984.71 13538.87 -3.313332 19.23899 0.5689971
##                      ACF1
## Training set -0.008145507
checkresiduals(bestfit)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,1,3) errors
## Q* = 158.07, df = 147, p-value = 0.2518
## 
## Model df: 5.   Total lags used: 152
print(best_K)
## [1] 2 5
shapiro.test(bestfit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bestfit$residuals
## W = 0.90564, p-value < 2.2e-16

Shapiro-Wilk normality test: p-value < 2.2e-16. Residuals are not normally distributed.

qqnorm(bestfit$residuals,main=expression(Normal~~Q-Q~~Plot))
qqline(bestfit$residuals)

Best K values for weekly and annual seasonality: 2, 5 with ARIMA(2,1,3) Errors.

AICc=6433.83, MAE=13538.87.

This model passed Ljung-Box test(p-value = 0.2518) but didn’t pass normality test.

Model 5: TBATS Model

library(forecast)
tbats1 <- tbats(daily_msts)
print(tbats1)
## TBATS(1, {1,2}, -, {<7,2>, <365.25,5>})
## 
## Call: tbats(y = daily_msts)
## 
## Parameters
##   Alpha: 0.01276081
##   Gamma-1 Values: -0.00147551 -0.003423304
##   Gamma-2 Values: 0.003033274 1.922459e-06
##   AR coefficients: 0.580015
##   MA coefficients: -0.236126 -0.071849
## 
## Seed States:
##              [,1]
##  [1,]  82565.4798
##  [2,]   1754.3544
##  [3,]   2015.1356
##  [4,]   9653.5158
##  [5,]   -583.5445
##  [6,] -29223.8878
##  [7,]    732.5216
##  [8,]   2022.6759
##  [9,]    101.6095
## [10,]  -1993.4930
## [11,]  18893.1044
## [12,]   6275.8671
## [13,]  -1684.3807
## [14,]  -4156.8555
## [15,]   1795.4921
## [16,]      0.0000
## [17,]      0.0000
## [18,]      0.0000
## 
## Sigma: 17684.81
## AIC: 20014.01
checkresiduals(tbats1)

## 
##  Ljung-Box test
## 
## data:  Residuals from TBATS
## Q* = 162.44, df = 152, p-value = 0.2664
## 
## Model df: 0.   Total lags used: 152
fc1 <- forecast(tbats1, h=365)
autoplot(fc1)

Model: TBATS(1, {1,2}, -, {<7,2>, <365.25,5>}).

TBATS Model gave the same values for K(2, 5) as the Dynamic Harmonic Regression but different ARIMA Errors {1,2}. This model passed Ljung-Box test. AIC=20014.01.

Model 6: Intervention Analysis(Covid)

inference <- read.csv("intervention_analysis_data.csv")
inference_ts <- ts(inference$number_of_rides, start = c(2018, 1), frequency = 365.25)
print(head(inference_ts))
## Time Series:
## Start = 2018 
## End = 2018.01368925394 
## Frequency = 365.25 
## [1]  5500 18818 24299  1922  4972  4295
print(length(inference_ts))
## [1] 2281
plot(inference_ts, xlab = "Time", ylab = "Number of Rides", main = "Daily Bike Rides from January 2018 to March 2024")

Conclusion: There is no intervention effect to be modeled from Covid-19.

3. Model Evaluation of Forecast Accuracy via Cross-Validation

Dynamic Harmonic Regression(expanding window and 7-day forecast horizon for cross-validation)

library(forecast)

model_1 <- function(x, h) {
  
  if (length(x) >= 365 && length(x) + h <= length(daily_msts)) {
    # Generate Fourier terms for the current subset x directly from msts
    x_fourier <- fourier(x, K=c(2, 5))

    # Fit the ARIMA model using fixed parameters
    fit <- Arima(x, order=c(2, 1, 3), seasonal=c(0, 0, 0), xreg=x_fourier)

    # Generate Fourier terms for forecasting
    future_fourier <- fourier(x, K=c(2, 5), h=h)

    # Forecast using the fitted ARIMA model with the future Fourier terms
    fc <- forecast(fit, xreg=future_fourier)
    return(fc)
  } else {
    print("Insufficient data for fitting and forecasting.")
    return(forecast(rep(NA, h), h=h))
  }
}

harmo_error_exp <- tsCV(daily_msts, model_1, h=7, initial=365)
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
harmo_error_exp <- na.omit(harmo_error_exp)
# print(harmo_error_exp)
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
## 
##     accuracy
# Mean Error
mean_error <- mean(harmo_error_exp, na.rm = TRUE)
print(paste("Mean Error:", mean_error))
## [1] "Mean Error: 2946.60696016812"
# Standard Deviation of Error
std_error <- sd(harmo_error_exp, na.rm = TRUE)
print(paste("Standard Deviation of Error:", std_error))
## [1] "Standard Deviation of Error: 21705.5838029752"
# Mean Absolute Error (MAE)
mae_value <- mae(harmo_error_exp, rep(0, length(harmo_error_exp)))
print(paste("Mean Absolute Error (MAE):", mae_value))
## [1] "Mean Absolute Error (MAE): 16245.9762863509"
# Root Mean Squared Error (RMSE)
rmse_value <- rmse(harmo_error_exp, rep(0, length(harmo_error_exp)))
print(paste("Root Mean Squared Error (RMSE):", rmse_value))
## [1] "Root Mean Squared Error (RMSE): 21900.7370845512"
# Mean Squared Error (MSE)
mse_value <- mse(harmo_error_exp, rep(0, length(harmo_error_exp)))
print(paste("Mean Squared Error (MSE):", mse_value))
## [1] "Mean Squared Error (MSE): 479642284.846634"
# Median Absolute Error (MedAE)
medae_value <- median(abs(harmo_error_exp), na.rm = TRUE)
print(paste("Median Absolute Error (MedAE):", medae_value))
## [1] "Median Absolute Error (MedAE): 12061.0429664374"

4. Future Forecast by Dynamic Harmonic Regression

future_fourier <- fourier(daily_msts, K=c(2, 5), h=30)
harmo_fc <- forecast(bestfit, xreg=future_fourier, level = c(80, 95))
print(harmo_fc)
##           Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2024.2493       94213.34 69652.06 123577.1 58475.24 141181.1
## 2024.2521       97550.66 71010.82 129554.9 59031.52 148846.8
## 2024.2548      104225.38 76254.72 137860.2 63595.91 158099.0
## 2024.2575      100769.97 73266.27 133958.6 60859.76 153972.9
## 2024.2603      100932.38 73385.88 134172.3 60959.93 154217.3
## 2024.2630       86476.82 61578.64 116869.5 50470.33 135330.1
## 2024.2658       84134.71 59685.67 114041.1 48799.52 132229.7
## 2024.2685       88985.74 63588.70 119927.2 52236.67 138698.5
## 2024.2712      102924.80 74955.68 136644.4 62328.34 156967.2
## 2024.2740      103952.29 75780.45 137897.2 63054.82 158348.6
## 2024.2767      108377.15 79417.46 143170.7 66300.18 164095.2
## 2024.2795      102083.16 74231.07 135689.7 61666.52 155955.0
## 2024.2822       93597.88 67301.54 125525.2 55508.83 144853.1
## 2024.2849       85599.31 60804.01 115907.4 49756.03 134332.2
## 2024.2877       96036.68 69273.12 128474.6 57250.88 148090.2
## 2024.2904      104844.03 76465.28 139029.6 63643.12 159622.7
## 2024.2932      111416.78 81872.90 146857.1 68471.38 168149.6
## 2024.2959      110557.35 81150.61 145855.1 67819.13 167070.4
## 2024.2986      109177.69 80012.76 144215.7 66801.76 165286.4
## 2024.3014       95747.09 68997.89 128184.8 56988.20 147806.9
## 2024.3041       91728.39 65727.88 123357.3 54089.31 142527.7
## 2024.3068       98372.55 71126.93 131351.3 58872.79 151277.5
## 2024.3096      111554.64 81943.24 147083.4 68513.76 168431.9
## 2024.3123      114139.79 84065.15 150171.0 70406.55 171801.1
## 2024.3151      117278.81 86658.46 153897.2 72728.28 175854.6
## 2024.3178      112021.75 82303.97 147673.9 68824.73 169094.9
## 2024.3205      101728.56 73844.09 135406.8 61276.51 155727.7
## 2024.3233       94417.15 67868.15 126657.1 55964.22 146176.5
## 2024.3260      104308.16 75945.33 138506.8 63141.81 159120.1
## 2024.3288      114742.01 84523.87 150941.5 70798.81 172671.2
plot(harmo_fc)

Compare forecast values with actual bike ride demand

test_set <- read.csv("24April_test_rides.csv")
fc_values <- harmo_fc$mean
actual_values <- test_set$number_of_rides

comparison_df <- data.frame(
  Actual = actual_values,
  Forecasted = fc_values
)
print(comparison_df)
##    Actual Forecasted
## 1   89230   94213.34
## 2   41395   97550.66
## 3   27031  104225.38
## 4   99708  100769.97
## 5  102359  100932.38
## 6   97087   86476.82
## 7   96966   84134.71
## 8  113957   88985.74
## 9  140252  102924.80
## 10 126407  103952.29
## 11  89245  108377.15
## 12 107309  102083.16
## 13  91502   93597.88
## 14 118128   85599.31
## 15 142162   96036.68
## 16 146951  104844.03
## 17 114294  111416.78
## 18  90995  110557.35
## 19 118962  109177.69
## 20 148800   95747.09
## 21  94132   91728.39
## 22 112278   98372.55
## 23 117106  111554.64
## 24 124140  114139.79
## 25 118147  117278.81
## 26 122802  112021.75
## 27 107860  101728.56
## 28 134159   94417.15
## 29 132321  104308.16
## 30 126078  114742.01
fc_values <- as.numeric(fc_values)

plot(actual_values, type = 'o', pch = 19, col = 'blue', ylim = c(0, max(c(actual_values, fc_values))),
     xlab = "Days", ylab = "Number of Rides", main = "Comparison of Actual vs Forecasted Rides for April 2024",
     xaxt = 'n')

lines(fc_values, type = 'o', pch = 19, col = 'red')

legend("topleft", legend = c("Actual", "Forecasted"), col = c("blue", "red"), pch = 19, lty = 1, cex = 0.8)

axis(1, at = 1:30, labels = 1:30)

Observations:

The model did well on capturing the general trends in the data, especially from day 21 to day 27. The forecasts for day 2 and day 3 are relatively poor because of the unusually low actual bike demands on those 2 days.

5. Future Work

  1. I used 2 years of data for this project. Since there is no intervention effect from Covid, I will expand the time horizon to several years in order for the models to better learn the annual patterns in data.

  2. The current best model Dynamic Harmonic Regression didn’t pass the normality test for its residuals. I will experiment with more advanced models like Neural Networks to solve this problem.